Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Quantization of Neural Networks

Quantization is a strategy that has demonstrated outstanding and consistent success in both

the training and inference of neural networks (NN). NN present unique opportunities for

advancement even though the issues of numerical representation and quantization are as old

as digital computing. Although most of this quantization survey is concerned with inference,

it is essential to note that quantization has also been successful in NN training [8, 42, 63,

105]. Innovations in half-precision and mixed-precision training in particular [47, 80] have

enabled greater throughput in AI accelerators. However, going below half-precision without

signiﬁcant tuning has proven to be challenging, and most recent quantization research has

concentrated on inference.

2.1

Overview of Quantization

Given an NN model of N layers, we denote its weight set as W = {wⁿ}^N

n=1 ^{and the input}

feature set as A = {aⁿ

in^}^N

n=1^{. The}^wⁿ^∈^R^Cⁿ

out^×^Cⁿ

in and aⁿ

in ^∈^R^Cⁿ

in are the convolutional

weight and the input feature map in the n-th layer, respectively, where Cⁿ

in ^and^Cⁿ

out ^re-

spectively stand for the input channel number and the output channel number. Then, the

outputs aⁿ

out ^{can be technically formulated as:}

aⁿ

out ⁼^wⁿ^·^aⁿ

in^,

(2.1)

where · represents matrix multiplication. In this paper, we omit the non-linear function

for simplicity. Following the prior works [100], quantized neural network (QNN) intends to

represent wⁿand aⁿin a low-bit format as

Q : = {q1, · · · , qU},

where qi, i = 1, · · · , U satisfying q1 < · · · < qU, are deﬁned as quantized values of the

original variable. Note that x can be the input feature aⁿor the weights wⁿ. In this way,

q^wⁿ∈Q^Cⁿ

out^×^Cⁿ

in and q^aⁿ

in ∈Q^Cⁿ

in such that the ﬂoat-point convolutional outputs can be

approximated by the eﬃcient XNOR and bit-count instructions as:

aⁿ

out ^≈^q^wⁿ^⊙^q^aⁿ

in.

(2.2)

The core item of QNNs is how to deﬁne a quantization set Q, which is described next.

2.1.1

Uniform and Non-Uniform Quantization

First, we must deﬁne a function capable of quantizing the weights and activations of the

NN to a ﬁnite set of values. The following is a popular choice for a quantization function:

q^x= INT( ^x

S ⁾⁻^Z,

(2.3)

DOI: 10.1201/9781003376132-2